Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview dataset #1288

Merged
merged 47 commits into from
Mar 24, 2023
Merged

Preview dataset #1288

merged 47 commits into from
Mar 24, 2023

Conversation

Huongg
Copy link
Contributor

@Huongg Huongg commented Mar 15, 2023

Description

Fixes for #907. For now, every time the user clicks on CSV, Excel or Parquet Dataset in viz, it will load the first 40 rows in the metadata panel

Design:
Link to the journey and behaviour description
Link to the metadata side panel

Development notes

To test this locally, ensure you clone and pull the latest changes from this branch https://github.com/kedro-org/kedro-plugins/tree/preview-csv-dataset

Then cd to location of kedro-plugins/kedro-datasets, then run pip install -e . To check, run pip list to see if kedro-dataset and kedro-viz are pointed to your local machine
Once done, go back to kedro-viz/demo-project run kedro run then kedro viz. It should show you all the changes from both repos.

WIP: tests to be added to cover both changes, here and in kedro-plugins

Screen.Recording.2023-03-16.at.11.02.58.mov

QA notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

@Huongg Huongg marked this pull request as ready for review March 17, 2023 09:53
@@ -20,6 +20,9 @@

logger = logging.getLogger(__name__)

PREVIEW_DATASETS = ["pandas.csv_dataset.CSVDataSet",
"pandas.parquet_dataset.ParquetDataSet", "pandas.excel_dataset.ExcelDataSet"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting Spark and launch would be a really nice stretch goal, but not a dealbreaker

if self.type in PREVIEW_DATASETS:
# If the kedro-datasets is on the latest and does have the _preview
if (hasattr(dataset, '_preview')):
self.preview = dataset._preview(40)
Copy link
Contributor

@datajoely datajoely Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this configurable here?
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our telemetry, hardly anyone uses this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. I would argue this is a slightly more compelling thing for users to actually change, but it's also something we can wait to see if users start asking for :)

Copy link
Contributor Author

@Huongg Huongg Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now done @datajoely 😄 I just included this today, by default this feature is always on unless user chooses to change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it toggled or can they pass the preview number of rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just toggled on and off at the moment, with the number of rows sets to 40

Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hugely excited about this :)

@Huongg Huongg requested a review from rashidakanchwala March 20, 2023 15:49
@antonymilne antonymilne self-requested a review March 21, 2023 10:52
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Huongg! ⭐

I have many comments, but most importantly I think you should test three things before you make any changes:

  1. what happens if you change your code in _preview in kedro-datasets to raise an exception (easiest way is to just put in the code 1/0?
  2. what happens if you run it on a dataset that's pandas.CSVDataSet but the file doesn't exist?
  3. what happens if you run it on a dataset that's `pandas.CSVDataSet but with some data missing?

I suspect the first two of these will result in no metadata panel loading for the dataset and that the 3rd test case will work.

Assuming I'm right here, you should then:

  1. Write a test that fails test case 2
  2. Make the changes I suggest
  3. Test the cases that failed before and hopefully they pass now

I have more ideas for where we should go with this feature, but I'll post it on a separate issue. I think we're also going to see cases which aren't currently handled well e.g. if the rows have labels as well as columns, but I'm happy to release this as an MVP for now.

package/kedro_viz/models/flowchart.py Outdated Show resolved Hide resolved
package/tests/test_models/test_flowchart.py Outdated Show resolved Hide resolved
src/components/metadata/metadata.js Outdated Show resolved Hide resolved
@Huongg Huongg requested a review from antonymilne March 23, 2023 09:01
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work 🎉

Just to double check, what now happens in these three cases?

  1. change your code in _preview in kedro-datasets to raise an exception (easiest way is to just put in the code 1/0)?
  2. a dataset that's pandas.CSVDataSet but the file doesn't exist?
  3. a dataset that's pandas.CSVDataSet but with some data missing?

@@ -13,6 +13,7 @@ Please follow the established format:
- Remove metrics plots from metadata panel and add link to the plots on Experiment tracking. (#1268)
- Link plot and JSON dataset names from experiment tracking to the flowchart. (#1165)
- Bump minimum version of React from 16.8.6 to 17.0.2. (#1282)
- Show preview of data in metadata panel. (#1288)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth putting this at the top of the list since to users it's much more interesting than the other changes 😀 (maybe the react version point should go under bug fixes and other changes?)

Maybe also worth saying "preview of pandas.CSVDataSet and pandas.ExcelDataSet" too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey i like the idea of including the "preview of pandas.CSVDataSet and pandas.ExcelDataSet" here too. Even though it will also mention it in our Release highlight in the UI, I guess no harm to mention here again.

I think the React version might be a major change actually but maybe @tynandebold can confirm this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The React version change will make this a major release, so probably ok to leave it where it is.

src/components/metadata-modal/metadata-modal.js Outdated Show resolved Hide resolved
@Huongg
Copy link
Contributor Author

Huongg commented Mar 23, 2023

Really nice work 🎉

Just to double check, what now happens in these three cases?

  1. change your code in _preview in kedro-datasets to raise an exception (easiest way is to just put in the code 1/0)?
  2. a dataset that's pandas.CSVDataSet but the file doesn't exist?
  3. a dataset that's pandas.CSVDataSet but with some data missing?

hey @AntonyMilneQB thank you. So to confirm:

  1. it will throw error as .... could not be previewed. Full exception: ZeroDivisionError: division by zero
  2. it will also throw error
'reviews' could not be previewed. Full exception: DataSetError: Failed while loading data from data set                              flowchart.py:600
CSVDataSet(filepath=/Users/Huong_Nguyen/Documents/dev/kedro-viz/demo-project/data/01_raw/reviews.csv, load_args={'nrows': 40},protocol=file, save_args={'index': False}). No columns to parse from file
  1. this one works normally, it shows some empty field in the table as the screenshot below

Screenshot 2023-03-23 at 17 58 34

Are these what you expected?

@antonymilne
Copy link
Contributor

Yes, all as I expected thank you! 👍

Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!!!!!! Thanks Huong!

@Huongg Huongg merged commit f8c2be5 into main Mar 24, 2023
@Huongg Huongg deleted the preview-dataset branch March 24, 2023 12:53
@tynandebold tynandebold mentioned this pull request Mar 24, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants